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In many modern applications, including analysis of gene expres- 
sion and text documents, the data are noisy, high-dimensional, and 
unordered — with no particular meaning to the given order of the vari- 
ables. Yet, successful learning is often possible due to sparsity: the 
fact that the data are typically redundant with underlying structures 
that can be represented by only a few features. In this paper we 
present treelets — a novel construction of multi-scale bases that ex- 
tends wavelets to nonsmooth signals. The method is fully adaptive, 
as it returns a hierarchical tree and an orthonormal basis which both 
reflect the internal structure of the data. Treelets are especially well- 
suited as a dimensionality reduction and feature selection tool prior 
to regression and classification, in situations where sample sizes are 
small and the data are sparse with unknown groupings of correlated 
or coUinear variables. The method is also simple to implement and 
analyze theoretically. Here we describe a variety of situations where 
treelets perform better than principal component analysis, as well as 
some common variable selection and cluster averaging schemes. We 
illustrate treelets on a blocked covariance model and on several data 
sets (hyperspectral image data, DNA microarray data, and internet 
advertisements) with highly complex dependencies between variables. 

1. Introduction. For many modern data sets (e.g., DNA microarrays, 
financial and consumer data, text documents and internet web pages), tlie 
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collected data are high- dimensional, noisy, and unordered, with no particu- 
lar meaning to the given order of the variables. In this paper we introduce 
a new methodology for the analysis of such data. We describe the theo- 
retical properties of the method, and illustrate the proposed algorithm on 
hyperspectral image data, internet advertisements, and DNA microarray 
data. These data sets contain structure in the form of complex groupings 
of correlated variables. For example, the internet data include more than 
a thousand binary variables (various features of an image) and a couple of 
thousand observations (an image in an internet page). Some of the vari- 
ables are exactly linearly related, while others are similar in more subtle 
ways. The DNA microarray data include the expression levels of several 
thousand genes but less than 100 samples (patients). Many sets of genes 
exhibit similar expression patterns across samples. The task in both cases is 
here classification. The results can therefore easily be compared with those 
of other classification algorithms. There is, however, a deeper underlying 
question that motivated our work: Is there a simple general methodology 
that, by construction, captures intrinsic localized structures, and that as a 
consequence improves inference and prediction of noisy, high-dimensional 
data when sample sizes are small? The method should be powerful enough 
to describe complex structures on multiple scales for unordered data, yet be 
simple enough to understand and analyze theoretically. Below we give some 
more background to this problem. 

The key property that allows successful inference and prediction in high- 
dimensional settings is the notion of sparsity. Generally speaking, there are 
two main notions of sparsity. The first is sparsity of various quantities related 
either to the learning problem at hand or to the representation of the data in 
the original given variables. Examples include a sparse regression or classifi- 
cation vector [Tibshirani (1996)], and a sparse structure to the covariance or 
inverse covariance matrix of the given variables [Bickel and Levina (2008)]. 
The second notion is sparsity of the data themselves. Here we are referring 
to a situation where the data, despite their apparent high dimensionality, 
are highly redundant with underlying structures that can be represented 
by only a few features. Examples include data where many variables are 
approximately collinear or highly related, and data that lie on a nonlin- 
ear manifold [Belkin and Niyogi (2005), Coifman et al. (2005)].^ While the 
two notions of sparsity are different, they are clearly related. In fact, a low 
intrinsic dimensionality of the data typically implies, for example, sparse 
regression or classification vectors, as well as low-rank covariance matrices. 
However, this relation may not be directly transparent, as the sparsity of 



^A referee pointed out that another issue with sparsity is that very high-dimensional 
spaces have very simple structure [Hall, Marron and Neeman (2005), Murtagh (2004), 
Ahn and Marron (2008)]. 
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these quantities sometimes becomes evident only in a different basis repre- 
sentation of the data. 

In either case, to take advantage of sparsity, one constrains the set of possi- 
ble parameters of the problem. For the first kind of sparsity, two key tools are 
graphical models [Whittaker (2001)] that assume statistical dependence be- 
tween specific variables, and regularization methods that penalize nonsparse 
solutions [Hastie, Tibshirani and Friedman (2001)]. Examples of such regu- 
larization methods are the lasso [Tibshirani (1996)], regularized covariance 
estimation methods [Bickel and Levina (2008), Levina and Zhu (2007)] and 
variable selection in high-dimensional graphs [Meinshausen and Biihlmann 
(2006)]. For the second type of sparsity, where the goal is to find a new 
set of coordinates or features of the data, two standard "variable trans- 
formation" methods are principal component analysis [Jolliffe (2002)] and 
wavelets [Ogden (1997)]. Each of these two methods has its own strengths 
and weaknesses which we briefiy discuss here. 

PCA has gained much popularity due to its simplicity and the unique 
property of providing a sequence of best linear approximations in a least 
squares sense. The method has two main limitations. First, PCA computes 
a global representation, where each basis vector is a linear combination of 
all the original variables. This makes it difficult to interpret the results and 
detect internal localized structures in the data. For example, in gene expres- 
sion data, it may be difficult to detect small subsets of highly correlated 
genes. The second limitation is that PCA constructs an optimal linear rep- 
resentation of the noisy observations, but not necessarily of the (unknown) 
underlying noiseless data. When the number of variables p is much larger 
than the number of observations re, the true underlying principal factors 
may be masked by the noise, yielding an inconsistent estimator in the joint 
limit p, re — > oo,p/n — > c [Johnstone and Lu (2008)]. Even for a finite sample 
size re, this property of PCA and other global methods (such as partial least 
squares and ridge regression) can lead to large prediction errors in regres- 
sion and classification [Buckheit and Donoho (1995), Nadler and Coifman 
(2005b)]. Equation (25) in our paper, for example, gives an estimate of the 
finite-n regression error for a linear mixture error-in-variables model. 

In contrast to PCA, wavelet methods describe the data in terms of lo- 
calized basis functions. The representations are multi-scale, and for smooth 
data, also sparse [Donoho and Johnstone (1995)]. Wavelets are used in many 
nonparametric statistics tasks, including regression and density estimation. 
In recent years wavelet expansions have also been combined with regulariza- 
tion methods to find regression vectors which are sparse in an a priori known 
wavelet basis [Candes and Tao (2007), Donoho and Elad (2003)]. The main 
limitation of wavelets is the implicit assumption of smoothness of the (noise- 
less) data as a function of its variables. In other words, standard wavelets 
are not suited for the analysis of unordered data. Thus, some work suggests 
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first sorting the data, and then applying fixed wavelets to the reordered 
data [Murtagh, Starck and Berry (2000), Murtagh (2007)]. 

In this paper we propose an adaptive method for multi-scale representa- 
tion and eigenanalysis of data where the variables can occur in any given 
order. We call the construction treelets, as the method is inspired by both 
hierarchical clustering trees and wavelets. The motivation for the treelets is 
two-fold: One goal is to find a "natural" system of coordinates that reflects 
the underlying internal structure of the data and that is robust to noise. 
A second goal is to improve the performance of conventional regression and 
classification techniques in the "large p, small n" regime by finding a reduced 
representation of the data prior to learning. We pay special attention to 
sparsity in the form of groupings of similar variables. Such low-dimensional 
structure naturally occurs in many data sets; for example, in DNA microar- 
ray data where genes sharing the same pathway can exhibit highly correlated 
expression patterns, and in the measured spectra of a chemical compound 
that is a linear mixture of certain simpler substances. Collinearity of vari- 
ables is often a problem for a range of existing dimensionality reduction 
techniques — including least squares, and variable selection methods that do 
not take variable groupings into account. 

The implementation of the treelet transform is similar to to the classical 
Jacobi method from numerical linear algebra [Golub and van Loan (1996)]. 
In our work we construct a data-driven multi-scale basis by applying a series 
of Jacobi rotations (PCA in two dimensions) to pairs of correlated variables. 
The final computed basis functions are orthogonal and supported on nested 
clusters in a hierarchical tree. As in standard PCA, we explore the covariance 
structure of the data but — unlike PCA — the analysis is local and multi-scale. 
As shown in Section 3.2.2 the treelet transform also has faster convergence 
properties than PCA. It is therefore more suitable as a feature extraction 
tool when sample sizes are small. 

Other methods also relate to treelets. In recent years hierarchical clus- 
tering methods have been widely used for identifying diseases and groups 
of co-expressed genes [Eisen et al. (1998), Tibshirani et al. (1999)]. Many 
researchers are also developing algorithms that combine gene selection and 
gene grouping; see, for example, Hastie et al. (2001), Dettling and Biihlmann 
(2004), Zou and Hastie (2005) among others, and see Fraley and Raftery 
(2002) for a review of model-based clustering. 

The novelty and contribution of our approach is the simultaneous con- 
struction of a data-driven multi-scale orthogonal basis and a hierarchical 
cluster tree. The introduction of a basis enables application of the well- 
developed machinery of orthonormal expansions, wavelets, and wavelet pack- 
ets for nonparametric smoothing, data compression, and analysis of general 
unordered data. As with any orthonormal expansion, the expansion coeffi- 
cients refiect the effective dimension of the data, as well as the significance 
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of each coordinate. In our case, we even go one step further: The basis func- 
tions themselves contain information on the geometry of the data, while the 
corresponding expansion coefficients indicate their importance; see examples 
in Sections 4 and 5. 

The treelet algorithm has some similarities to the local Karhunen-Loeve 
Basis for smooth ordered data by Coifman and Saito (1996), where the basis 
functions are data-driven but the tree structure is fixed. Our paper is also 
related to recent independent work on the Haar wavelet transform of a den- 
drogram by Murtagh (2007). The latter paper also suggests basis functions 
on a data-driven cluster tree but uses fixed wavelets on a pre-computed den- 
drogram. The treelet algorithm offers the advantages of both approaches, as 
it incorporates adaptive basis functions as well as a data-driven tree struc- 
ture. As shown in this paper, this unifying property turns out to be of 
key importance for statistical inference and prediction: The adaptive tree 
structure allows analysis of unordered data. The adaptive treelet functions 
lead to results that reflect the internal localized structure of the data, and 
that are stable to noise. In particular, when the data contain subsets of 
co-varying variables, the computed basis is sparse, with the dominant basis 
functions effectively serving as indicator functions of the hidden groups. For 
more complex structure, as illustrated on real data sets, our method returns 
"softer," continuous-valued loading functions. In classification problems, the 
treelet functions with the most discriminant power often compute differences 
between groups of variables. 

The organization of the paper is as follows: In Section 2 we describe the 
treelet algorithm. In Section 3 we examine its theoretical properties. The 
analysis includes the general large-sample properties of treelets, as well as 
a specific covariance model with block structure. In Section 4 we discuss 
the performance of the treelet method on a linear mixture error-in-variable 
model and give a few illustrative examples of its use in data representation 
and regression. Finally, in Section 5 we apply our method to classification 
of hyperspectral data, internet advertisements, and gene expression arrays. 

A preliminary version of this paper was presented at AISTATS-07 
[Lee and Nadler (2007)]. 

2. The treelet transform. In many modern data sets the data are not 
only high-dimensional but also redundant with many variables related to 
each other. Hierarchical clustering algorithms [Jain, Murty and Flynn (1999), 
Xu and Wunsch (2005)] are often used for the organization and grouping of 
the variables of such data sets. These methods offer an easily interpretable 
description of the data structure in terms of a dendrogram, and only require 
the user to specify a measure of similarity between groups of observations or 
variables. So called agglomerative hierarchical methods start at the bottom 
of the tree and, at each level, merge the two groups with highest inter-group 
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similarity into one larger cluster. The novelty of the proposed treelet al- 
gorithm is in constructing not only clusters or groupings of variables, but 
also functions on the data. More specifically, we construct a multi-scale or- 
thonormal basis on a hierarchical tree. As in standard multi-resolution analy- 
sis [Mallat (1998)], the treelet algorithm provides a set of "scaling functions" 
defined on nested subspaces Vq D Vi D • • O Vl , and a set of orthogonal "de- 
tail functions" defined on residual spaces {Wi}^^^, where © We = V^-i- 
The treelet decomposition scheme represents a multi-resolution transform, 
but technically speaking, not a wavelet transform. (In terms of the tiling of 
"time-frequency" space, the method is more similar to local cosine trans- 
forms, which divide the time axis in intervals of varying sizes.) The details 
of the treelet algorithm are in Section 2.1. 

In this paper we measure the similarity Mij between two variables Si and 
Sj with the correlation coefficient 



where Sjj = E[(sj — Esj)(sj — Esj)] is the usual covariance. Other information- 
theoretic or graph-theoretic similarity measures are also possible. For some 
applications, one may want to use absolute values of correlation coefficients, 
or a weighted sum of covariances and correlations as in Mij = \pij \ + A|Ilij|, 
where the parameter A is a nonnegative number. 

2.1. The algorithm: Jacobi rotations on pairs of similar variables. The 
treelet algorithm is inspired by the classical Jacobi method for computing 
eigenvalues of a matrix [Golub and van Loan (1996)]. There are also some 
similarities with the Grand Tour [Asimov (1985)], a visualization tool for 
viewing multidimensional data through a sequence of orthogonal projections. 
The main difference from Jacobi's method — and the reason why the treelet 
transform, in general, returns an orthonormal basis different from standard 
PCA — is that treelets are constructed on a hierarchical tree. 

The idea is simple. At each level of the tree, we group together the most 
similar variables and replace them by a coarse-grained "sum variable" and 
a residual "difference variable." The new variables are computed by a lo- 
cal PCA (or Jacobi rotation) in two dimensions. Unlike Jacobi's original 
method, difference variables are stored, and only sum variables are processed 
at higher levels of the tree. Hence, the multi-resolution analysis (MRA) in- 
terpretation. The details of the algorithm follows: 

• At level i = (the bottom of the tree), each observation or "signal" x is 
represented by the original variables x^*^) = [so,i, . . . ,so,p]"^) where so,fc = 
Xk- Associate to these coordinates, the Dirac basis, Bq = [(/>o,i, </'o,2i • • • 1 4>o,p]^ 
where Bq is the pxp identity matrix. Compute the sample covariance and 
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similarity matrices T,^'^^ and M^^\ Initialize the set of "sum variables," 
S = {l,2,...,p}. 
Repeat for ^ = 1, . . . , L: 



1. Find the two most similar sum variables according to the similarity 
matrix M^^"^). Let 



(2) 



: arg max M ■ ■ , 

j JG-S 



where i < j, and maximization is only over pairs of sum variables that 
belong to the set S. As in standard wavelet analysis, difference variables 
(defined in step 3) are not processed. 
2. Perform a local PCA on this pair. Find a Jacobi rotation matrix 



(3) 



J{a,P, 



1 



























1 



where c = cos(0^) and s = sin(0^), that decorrelates and x/s; more 
specifically, find a rotation angle 6i such that \6i\ < 7r/4 and = 
0, where S^^-* = J^S^^"-*^) J. This transformation corresponds to 



a change of basis = B£_iJ, and new coordinates x^^^ = J^x^^ ^\ 
Update the similarity matrix M^^^ accordingly. 
3. Multi-resolution analysis. For ease of notation, assume that T,aa > 
after the Jacobi rotation, where the indices a and /? correspond to the 
first and second principal components, respectively. Define the sum 



and df 



'13 



. Similarly, 



and difference variables at level £ as S£ 

define the scaling and detail functions (f>i and ipi as columns a and /3 
of the basis matrix B^. Remove the difference variable from the set of 
sum variables, S = S \ {/?}. At level £, we have the orthonormal treelet 
decomposition 



(4) 



i=l 



1=1 



where the new set of scaling vectors {(t>i,i}^=i is the union of the vector 
and the scaling vectors {4>£-i,j}j^a,f3 from the previous level, and 
the new coarse-grained sum variables {s£^i}f~^ are the projections of 
the original data onto these vectors. As in standard multi-resolution 
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analysis, the first sum is the coarse-grained representation of the signal, 
while the second sum captures the residuals at different scales. 

The output of the algorithm can be summarized in terms of a hierarchi- 
cal tree with a height L <p — 1 and an ordered set of rotations and pairs of 
indices, {{Oe,Gie, l3e)}i=i- Figure 1 (left) shows an example of a treelet con- 
struction for a "signal" of length p = 5, with the data representations x^^^ at 
the different levels of the tree shown on the right. The s-components (pro- 
jections in the main principal directions) represent coarse-grained "sums." 
We associate these variables to the nodes in the cluster tree. Similarly, the 
d-components (projections in the orthogonal directions) represent "differ- 
ences" between node representations at two consecutive levels in the tree. 
For example, in the figure diV'i = (so,ii;^o,i + ■so,2';^o,2) — sicpi^i. 

We now briefly consider the complexity of the treelet algorithm on a gen- 
eral data set with n observations and p variables. For a naive implementa- 
tion with an exhaustive search for the optimal pair {a, (3) in Equation 2, the 
overall complexity is m + 0{Lp'^) operations, where m = 0{mm{np'^ ,pn'^)) 
is the cost of computing the sample covariance matrix by singular value 
decomposition, and L is the height of the tree. However, by storing the 
similarity matrices S^'^^ and M^'^^ and keeping track of their local changes, 
the complexity can be further reduced to m -|- 0{Lp). In other words, the 
computational cost is comparable to hierarchical clustering algorithms. 

2.2. Selecting the height L of the tree and a "best K-basis." The default 
choice of the treelet transform is a maximum height tree with L = p — 1 ; 

Si ri, d'j d, li-i r 
(/] d-A .^2 (ii 1'^ 

■''1 '^1 No.:! ■■^[1,-1 fu.fl]'''" 
«n,i snj .'*f),3 .snj -^n.sF 

Fig. 1. (Left) A toy example of a hierarchical tree for data of dimension p = 5. At 
£ = 0, the signal is represented by the original p variables. At each successive level 
£ = 1,2, . . . ,p — 1, the two most similar sum variables are combined and replaced by the 
sum and difference variables Si,dt corresponding to the first and second local principal 
components. (Right) Signal representation x^*^ at different levels. The s- and d-coordinates 
represent projections along scaling and detail functions in a multi-scale treelet decomposi- 
tion. Each such representation is associated with an orthogonal basis in V that captures 
the local eigenstructure of the data. 
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see examples in Sections 5.1 and 5.3. This choice leads to a fully parameter- 
free decomposition of the data and is also faithful to the idea of a multi- 
resolution analysis. For more complexity, one can alternatively also choose 
any of the orthonormal (ON) bases at levels £ <p — 1 of the tree. The data 
are then represented by coarse-grained sum variables for a set of clusters in 
the tree, and difference variables that describe the finer details. In princi- 
ple, any of the standard techniques in hierarchical clustering can be used 
in deciding when to stop "merging" clusters (e.g., use a preset threshold 
value for the similarity measure, or use hypothesis testing for homogeneity 
of clusters, etc.). In this work we propose a rather different method that 
is inspired by the best basis paradigm [Coifman and Wickerhauser (1992), 
Saito and Coifman (1995)] in wavelet signal processing. This approach di- 
rectly addresses the question of how well one can capture information in the 
data. 

Consider IID data x^, . . . ,x", where x* G is a p-dimensional random 
vector. Denote the candidate ON bases by Bq, . . . , Bp^i, where Bi is the 
basis at level i in the tree. Suppose now that we are interested in finding 
the "best" i^'-dimensional treelet representation for data representation and 
compression, where the dimension K <p has been determined in advance. 
It then makes sense to use a scoring criterion that measures the percent- 
age of explained variance for the chosen coordinates. Thus, we propose the 
following greedy scoring and selection approach: 

For a given orthonormal basis B = (wi, . . . , Wp), assign a normalized en- 
ergy score £■ to each vector Wj according to 

The corresponding sample estimate is <?(w) = — ibc^P"' ^^'^^ vectors 
according to decreasing energy, ^{\) , ■ ■ ■ , ^{p) , and define the score Tk of the 
basis B by summing the K largest terms, that is, let Tk{B) =J2j^i^{'^i)- 
The best K-basis is the treelet basis with the highest score 

(6) Bl = arg max TxiBi). 

Bl : 0<i<p—l 

It is the basis that best compresses the data with only K components. In 
case of degeneracies, we choose the coordinate system with the smallest i. 
Furthermore, to estimate the score Tk for a particular data set, we use 
cross-validation (CV); that is, the treelets are constructed using subsets of 
the original data set and the score is computed on independent test sets 
to avoid overfitting. Both theoretical calculations (Section 3.2) and simu- 
lations (Section 4.1) indicate that an energy-based measure is useful for 
detecting natural groupings of variables in data. Many alternative measures 
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(e.g., Fisher's discriminant score, classification error rates, entropy, and other 
sparsity measures) can also be used. For the classification problem in Sec- 
tion 5.1, for example, we define a discriminant score that measures how well 
a coordinate separates data from different classes. 

3. Theory. 

3.1. Large sample properties of the treelet transform. In this section we 
examine the large sample properties of treelets. We introduce a more general 
definition of consistency that takes into account the fact that the treelet 
operator (based on correlation coefficients) is multi-valued, and study the 
method under the stated conditions. We also describe a bootstrap algorithm 
for quantifying the stability of the algorithm in practical applications. The 
details are as follows. 

First some notation and definitions: Let T(S) = J^TiJ denote the co- 
variance matrix after one step of the treelet algorithm when starting with 
covariance matrix S. Let T^(S) denote the covariance matrix after (. steps 
of the treelet algorithm. Thus, = T o • • ■ oT corresponds to T applied I 
times. Define ||^||oo = maxj^fc \^jk\ and let 

(7) T„(S,<5„)= U r(A). 

||A-S||oo<5„ 

Define 'T^{ll,5n) =7^(S,(5„), and 

(8) Ti(S,<^„)= U r(A), t>2. 

Let S„ denote the sample covariance matrix. We make the following as- 
sumptions: 

(Al) Assume that x has finite variance and satisfies one of the following 
three assumptions: (a) each xj is bounded or (b) x is multivariate normal 
or (c) there exist M and s such that E(|xjXfc|'^) < q\M'^~'^s/2 for all q>2. 
(A2) The dimension p„ satisfies pn < for some c> 0. 

Theorem 1. Suppose that (Al) and (A2) hold. Let 5n = KyJ\ogn/n, 
where K > 2c. Then, as n,pn^ oo, 

(9) P(T^(Sn) G Ti(S, 6n),i=h...,Pn)^l. 

Some discussion is in order. The result says that T^(T,n) is not too far 
from T^(A) for some A close to S. It would perhaps be more satisfying to 
have a result that says that ||T^(S) — T^(S)||oo converges to 0. This would 
be possible if one used covariances to measure similarity, but not in the case 
of correlation coefficients. 



TREELETS 



11 



For example, it is easy to construct a covariance matrix S with the fol- 
lowing properties: 

1. pi2 is the largest ofF-diagonal correlation, 

2. /934 is nearly equal to pi2-, 

3. the 2x2 submatrix of S corresponding to xi and X2 is very different 
than the 2x2 submatrix of S corresponding to 3:3 and x/^. 

In this case there is a nontrivial probability that /J34 > f)i2 due to sample 
fluctuations. Therefore T(S) performs a rotation on the first two coordi- 
nates, while r(S) performs a rotation on the third and fourth coordinates. 
Since the two corresponding submatrices are quite different, the two rota- 
tions will be quite different. Hence, r(S) can be quite different from T(S). 
This does not pose any problem since inferring r(S) is not the goal. Under 
the stated conditions, we would consider both T'(E) and r(S) to be rea- 
sonable transformations. We examine the details and include the proof of 
Theorem 1 in Appendix A.l. 

Because T'(Si) and T{Yi2) can be quite different even when the ma- 
trices Si and S2 are close, it might be of interest to study the stabil- 
ity of T(S„). This can be done using the bootstrap. Construct B boot- 
strap replications of the data and corresponding sample covariance matrices 
S* 1, . . . , S* ^. Let (5„ = J~^(l — a), where Jn is the empirical distribution 

function of ^ — S„||oo, ^ = 1, • . • , B} and a is the confidence level. If F 
has finite fourth moments and p is fixed, then it follows from Corollary 1 
of Beran and Srivastava (1985) that 

lim PiT'fS G C„) = 1 - a, 

71— >00 

where C,i = {A : ||A — Sn||oo < ^n]- Let 

An = {T[K):KeCn}. 

It follows that P(r(S) G A„) ^ 1 - a. The set A n can be approximated by 
applying T to all S* for which US* ^ — S„||oo < Sn- In Section 4.1 (Figure 3) 
we use the bootstrap method to estimate confidence sets for treelets. 

3.2. Treelets on covariance matrices with block structures. 

3.2.1. An exact analysis in the limit n— >oo. Many real life data sets, 
including gene arrays, consumer data sets, and word-documents, display co- 
variance matrices with approximate block structures. The treelet transform 
is especially well suited for representing and analyzing such data — even for 
noisy data and small sample sizes. 

Here we show that treelets provide a sparse representation when covari- 
ance matrices have inherent block structures, and that the loading functions 
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themselves contain information about the inherent groupings. We consider 
an ideal situation where variables within the same group are collinear, and 
variables from different groups are weakly correlated. All calculations are 
exact and computed in the limit of the sample size n ^ oo. An analysis of 
convergence rates later appears in Section 3.2.2. 

We begin by analyzing treelets on p random variables that are indistin- 
guishable with respect to their second-order statistics. We show that the 
treelet algorithm returns scaling functions that are constant on groups of 
indistinguishable variables. In particular, the scaling function on the full set 
of variables in a block is a constant function. Effectively, this function serves 
as an indicator function of a (sometimes hidden) set of similar variables in 
data. These results, as well as the follow-up main results in Theorem 2 and 
Corollary 1, are due to the fully adaptive nature of the treelet algorithm — 
a property that sets treelets apart from methods that use fixed wavelets 
on a dendrogram [Murtagh (2007)], or adaptive basis functions on fixed 
trees [Coifman and Saito (1996)]; see Remark 2 for a concrete example. 

Lemma 1. Assume that x = (xi, X2, . . . , Xp)"^ is a random vector with 
distribution F, mean 0, and covariance matrix T, = a\lpxp, where Ipxp de- 
notes a p X p matrix with all entries equal to 1. Then, at any level 1<1< 
p — 1 of the tree, the treelet operator ( defined in Section 3.1) returns (for 
the population covariance matrix Ti) an orthogonal decomposition 

p-i e 

(10) T\^) = Y,s,,^<Pe,^+Y,d^^l^i, 

1=1 i=l 

with sum variables Sf^i = ^ J2jeAe i -^j '^^^ scaling functions (f>£^i = ^ x 

Isf. -, which are defined on disjoint index subsets Ai^i C {1, . . . ,p} (i = l,. . . ,p — 

£) with lengths \Ae^i\ and J2^Zi\-^e,i\ =P- The expansion coefficients have 
variances Y{s£^i} = \ Ae^i\cr1, and Y{di} = 0. In particular, for i = p — 1, 

(11) TP-\^) = s^+Y,di^i, 

1=1 

where s = -^(xi H \- Xp) and cp = ^[l . . .1]"^ . 

Remark 1. Uncorrelated additive noise in (xi, X2, . . . , Xp) adds a diago- 
nal perturbation to the 2x2 covariance matrices T,^^\ which are computed 
at each level in the tree [see (35)]. Such noise may affect the order in which 
variables are grouped, but the asymptotic results of the lemma remain the 
same. 
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Remark 2. The treelet algorithm is robust to noise because it com- 
putes data-driven rotations on variables. On the other hand, methods that 
use fixed transformations on pre-computed trees are often highly sensitive to 
noise, yielding inconsistent results. Consider, for example, a set of four sta- 
tistically indistinguishable variables {xi,X2,X3,X4}, and compare treelets to 
a Haar wavelet transform on a data-driven dendrogram [Murtagh (2004)]. 
The two methods return the same results if the variables are merged in 
the order {{a^i, X2}, {xa, X4}}; that is, s = ^{xi + 2:2 + ^3 -|- x^) and (j) = 
^[1,1,1,1]"^. Now, a different realization of the noise may lead to the or- 
der {{{xi,X2},X4},X3}. A fixed rotation angle of 7r/4 (as in Haar wavelets) 
would then return the sum variable s^^'^^ = + ^2) + X4) + X3) 

and scaling function (f)^^"^ = ^, ^f. 

Next we consider data where the covariance matrix is & K x K block 
matrix with white noise added to the original variables. The following main 
result states that, if variables from different blocks are weakly correlated and 
the noise level is relatively small, then the K maximum variance scaling 
functions are constant on each block (see Figure 2 in Section 4 for an exam- 
ple). We make this precise by giving a sufficient condition [equation (13)] in 
terms of the noise level, and within-block and between-block correlations of 
the original data. For illustrative purposes, we have reordered the variables. 
A pxp identity matrix is denoted by Ip, and apiX pj matrix with all entries 
equal to 1 is denoted by Ip^xpj- 



Theorem 2. Assume that x = {xi,X2, ■ ■ ■ ,Xp)'^ is a random vector with 

P' 



distribution F , mean and covariance matrix S = C + cr^/p, where <t^ rep 



resents the variance of white noise in each variable and 
(12) C 



I Cii C12 . . . C\K \ 

C12 C22 ■ ■ ■ C2K 



\CiK C2K ■■■ Ckk J 
is a K X K block matrix with "within-block'' covariance matrices C^k = 
o'lXp^xpk (k= 1,. . . ,K) and "between-block" covariance matrices Cij = CTijlp-xpj 
(i,j = l,...,K;i^j). If 

(13) max ( < ^ 



where 5 = — , then the treelet decomposition at level i = p — K has the 



i<i,j<K\aiaj J yrT3max(F7F)' 

6 = — - — 

mint trt ' 

form 

K p-K 

(14) TP-^(S) = ^Sfe</.fe+ 

k=l i=l 
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where Sk = -7^J2-iPFi, ^i, (kk = —j^Ibi., o-i^d Bk represents the set of indices 
of variables in block k (k = 1, . . . , K ). The expansion coefficients have means 
E{sfc} = Ejdj} = 0, and variances V{sfc} =P/tO"| + o"^ and Y{di} = O(cj^), 
for i = 1, . . . ,p — K . 

Note that if the conditions of the theorem are satisfied, then all treelets 
{both scaling and difference functions) associated with levels i> p — K are 
constant on groups of similar variables. In particular, for a full decomposition 
at the maximum level i = p — 1 of the tree we have the following key result, 
which follows directly from Theorem 2: 

Corollary 1. Assume that the conditions in Theorem 2 are satisfied. 
A full treelet decomposition then gives r?'-i(S) = scP + E^Zld^^pi, where 
the scaling function (j) and the K — 1 detail functions ipp-x+i, ■ ■ ■ , V'p-i ^''"^ 
constant on each of the K blocks. The coefficients s and dp-x+i, ■ ■ ■ ,dp-i 
reflect between-block structures, as opposed to the coefficients di, . . . ,dp-K 
which only reflect noise in the data with variances Y{di} = 0{a'^) for i = 
l,...,p-K. 

The last result is interesting. It indicates a parameter-free way of finding 
K, the number of blocks, namely, by studying the energy distribution of a 
full treelet decomposition. Furthermore, the treelet transform can uncover 
the block structure even if it is hidden amidst a large number of background 
noise variables (see Figure 3 for a simulation with finite sample size). 

Remark 3. Both Theorem 2 and Corollary 1 can be directly general- 
ized to include po uncorrelated noise variables, so that x= {xi, . . . ,Xp-pg, 
Xp-pg^i, . . . , Xp)'^ , where E(xi) = and ]K{xiXj) = for i> p — pQ and j i. 
For example, if equation (13) is satisfied, then the treelet decomposition at 
level i = p — Po is 

K p-po-K 
TP~P"{^) = Y,Sk^k+ E + {0,---,0,Xp-po+i,...,Xpf. 

k=l i=l 

Furthermore, note that according to equation (41) in the Appendix A. 3, 
within-block correlations are smallest ("worst-case scenario") when single- 
tons are merged. Thus, the treelet transform is a stabilizing algorithm; once 
a few correct coarse-grained variables have been computed, it has the effect 
of denoising the data. 

3.2.2. Convergence rates. The aim of this section is to give a rough 
estimate of the sample size required for treelets to discover the inherent 
structures of data. For covariance matrices with block structures, we show 
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that treelets find the correct groupings of variables if the sample size n ^ 
0{\ogp), where p is the dimension of the data. This is a significant re- 
sult, as standard PC A — on the other hand — is consistent if and only if 
p/n ^ [Johnstone and Lu (2008)], that is, when n » 0{p). The result 
is also comparable to that in Bickel and Levina (2008) for regularization 
of sparse nearly diagonal covariance matrices. One main difference is that 
their paper assumes an a priori known ordered set of variables in which the 
covariance matrix is sparse, whereas treelets find such an ordering and co- 
ordinate system as part of the algorithm. The argument for treelets and a 
block covariance model goes as follows. 

Assume that there are K blocks in the population covariance matrix S. 
Define n as the event that the K maximum variance treelets, constructed 
at level L = p — K oi the tree, for a data set with n observations, are sup- 
ported only on variables from the same block. In other words, let -A/^ rep- 
resent the ideal case where the treelet transform finds the exact groupings 
of variables. Let Ei denote the event that at level i of the tree, the largest 
between-block sample correlation is less than the smallest within-block sam- 
ple correlation, 

Ei = jmaxp^' < mmp'^}. 
According to equations (31)-(32), the corresponding population correlations 



where 6 = — , for all i. Thus, a sufficient condition for Ef is that {max Ior — 

mirifc (Tfc ' ' I I \r ts 

p'bI < n {max — ply I < t} , where t = {p2 — pi)/2 > 0. We have that 






1 





If (Al) holds, then it follows from Lemma 3 that 



mZn)< E (P(max|p^'^-p^'^|>t)+P(max|p;j 




o<e<L 
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From the large-sample properties of treelets (Section 3.1), it follows that 
treelets are consistent if n ^ O(logp). 

4. Treelets and a linear error-in- variables mixture model. In this section 
we study a simple error-in-variables linear mixture model (factor model) 
which, under some conditions, gives rise to covariance matrices with block 
structures. Under this model, we compare treelets with PCA and variable 
selection methods. An advantage of introducing a concrete generative model 
is that we can easily relate our results to the underlying structures or compo- 
nents of real data; for example, different chemical substances in spectroscopy 
data, genes from the same pathway in microarray data, etc. 

In light of this, consider a linear mixture model with K components and 
additive noise. Each multivariate observation x G RP has the form 

K 

(15) X = ^ "UjVj + az. 

j=i 

The components or "factors" uj are random (but not necessarily indepen- 
dent) variables with variances (t|. The "loading vectors" Vj are fixed, but 
typically unknown linearly independent vectors. In the last term, a repre- 
sents the noise level, and z A/'p(0,/) is a p-dimensional random vector. 

In the unsupervised setting, we are given a training set {xj}"^]^ sampled 
from equation (15). Unsupervised learning tasks include, for example, infer- 
ence on the number of components K, and on the underlying vectors Vj. In 
the supervised setting, we consider a data set {xj, yj}"^;^, where the response 
value y of an observation x is a linear combination of the variables Uj with 
a random noise term e, 

K 

(16) y = ^ajUj+e. 

The standard supervised learning task in regression and classification is pre- 
diction of y for new data x, given a training set {xj,yj}"^^. 

Linear mixture models are common in many fields, including spectroscopy 
and gene expression analysis. In spectroscopy equation (15) is known as 
Beer's law, where x is the logarithmic absorbance spectrum of a chemical 
substance measured at p wavelengths, Uj are the concentrations of con- 
stituents with pure absorbance spectra Vj, and the response y is typically 
one of the components, y = Ui. In gene data x is the measured expression 
level of p genes, Uj are intrinsic activities of various pathways, and each 
vector Vj represents the set of genes in a pathway. The quantity y is typi- 
cally some measure of severity of a disease, such as time until recurrence of 
cancer. A linear relation between y and the values of Uj, as in equation (16), 
is commonly assumed. 
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4.1. Treelets and a linear mixture model in the unsupervised setting. Con- 
sider data {xj}-L]^ from the model in equation (15). Here we analyze an il- 
lustrative example with = 3 components and loading vectors = I{Bk), 
where / is the indicator function, and C {1,2, ... ,p} are sets of variables 
with sizes pk = \Bk\ (k = 1, 2, 3). A more general analysis is possible but may 
not provide more insight. 

The unsupervised task is to uncover the internal structure of the linear 
mixture model from data, for example, to infer the unknown structure of the 
vectors , including the sizes pk of the sets Bk . The difficulty of this problem 
depends, among other things, on possible correlations between the random 
variables uj, the variances of the components uj, and interferences (overlap) 
between the loading vectors v^. We present three examples with increasing 
difficulty. Standard methods, such as principal component analysis, succeed 
only in the simplest case (Example 1), whereas more sophisticated methods, 
such as sparse PC A (elastic nets), sometimes require oracle information to 
correctly fit tuning parameters in the model. The treelet transform seems 
to perform well in all three cases. Moreover, the results are easy to explain 
by computing the covariance matrix of the data. 

Example 1 (Uncorrelated factors and nonoverlapping loading vectors). 
The simplest case is when the random variables uj are all uncorrelated for 
j = 1,2,3, and the loading vectors vj are nonoverlapping. The population 
covariance matrix of x is then given hy T, = C + cr'^Ip, where the noise- free 
matrix 

/Cu 0\ 
C22 ' 
C33 

V 0/ 

is a 4 X 4 block matrix with the first three blocks C^k = '^k^Pk>^Pk ~ ^' ^' 
and the last diagonal block having all entries equal to zero. 

Assume that ak ^ cr for /c = 1,2,3. This is a specific example of a spiked 
covariance model [Johnstone (2001)] the three components corresponding 
to distinct large eigenvalues or "spikes" of a model with background noise. 
As n — > 00 with p fixed, PCA recovers the hidden vectors vi, V2, and V3, 
since these three vectors exactly coincide with the principal eigenvectors of 
S. A treelet transform with a height L determined by cross-validation and 
a normalized energy criterion returns the same results — which is consistent 
with Section 3.2 (Theorem 2 and Corollary 1). 

The difference between PCA and treelets becomes obvious in the "small 
n, large p regime." In the joint limit p,n ^ 00, standard PCA computes 
consistent estimators of the vectors Vj (in the presence of noise) if and 
only if p{n)/n [Johnstone and Lu (2008)]. For an analysis of PCA for 
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finite p^n, see, for example, Nadler (2007). As described in Section 3.2.2, 
treelets require asymptotically far fewer observations with the condition for 
consistency being \ogp{n)/n — > 0. 

Example 2 (Correlated factors and nonoverlapping loading vectors). If 
the random variables Uj are correlated, treelets are far better than PCA at in- 
ferring the underlying localized structure of the data — even asymptotically. 
Again, this is easy to explain and quantify by studying the data covariance 
structure. For example, assume that the loading vectors vi, V2, and V3 are 
nonoverlapping, but that the corresponding factors are dependent according 
to 



(18) 



ni~iV(0,a?), n2~iV(0,ai), 



The covariance matrix is then given by S : 



li3 = CiUi + 02^2. 

: C + (j'^Ip, where 



(19) 



C 





Ci3 

V 





C23 




Cl3 
C23 
C33 





0\ 




0/ 



(note that o"! 



Ci3 — Cicr;^lpjxp3i and 
,j , the loading vectors of 



with Ckk = 

C23 = C20"2lp2xp3- Due to the correlations between u 
the block model no longer coincide with the principal eigenvectors, and it is 
difficult to extract them with PCA. 

We illustrate this problem by the example in Zou, Hastie and Tibshirani 
(2006). Specifically, let 

(TiTi ooVo 



(20) 



Vl 



V2 



V3 



[0 
[0 



1111 





Bs 

oof. 



where there are p = 10 variables total, and the sets Bj are disjoint with 
Pi=P2 = 4,P3 = 2 variables, respectively. Let af = 290, o"! = 300, ci = —0.3, 
C2 = 0.925, and a = 1. The corresponding variance fig of is 282.8, and the 
covariances of the off-diagonal blocks are (T13 = —87 and (T23 = 277.5. 

The first three PCA vectors for a training set of 1000 samples are shown 
in Figure 2 (left). It is difficult to infer the underlying vectors Vj from 
these results, as ideally, we would detect that, for example, the variables 
(x5, xe, xt,xs) are all related and extract the latent vector V2 from only these 
variables. Simply thresholding the loadings and discarding small values also 
fails to achieve this goal [Zou, Hastie and Tibshirani (2006)]. The example 
illustrates the limitations of a global approach even with an infinite num- 
ber of observations. In Zou, Hastie and Tibshirani (2006) the authors show 
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Fig. 2. In Example 2 PCA fails to find the important variables in the three- component 
mixture model, as the computed eigenvectors (left) are sensitive to correlations between dif- 
ferent components. On the other hand, the three maximum energy treelets (right) uncover 
the underlying data structures. 



by simulation that a combined Li and L2-penalized least squares method, 
which they call sparse PCA or elastic nets, correctly identifies the sets of 
important variables if given "oracle information" on the number of variables 
Pi,P2,P3 in the different blocks. Treelets are similar in spirit to elastic nets 
as both methods tend to group highly correlated variables together. In this 
example the treelet algorithm is able to find both K, the number of compo- 
nents in the mixture model, and the hidden loading vectors v, — without any 
a priori knowledge or parameter tuning. Figure 2 (right) shows results from 
a treelet simulation with a large sample size (n = 1000) and a height L = 7 
of the tree, determined by cross-validation (CV) and an energy criterion. 
The three maximum energy basis vectors correspond exactly to the hidden 
loading vectors in equation (20). 



Example 3 (Uncorrelated factors and overlapping loading vectors). Fi- 
nally, we study a challenging example where the first two loading vectors 
vi and V2 are overlapping, the sample size n is small, and the background 
noise level is high. Let {Bi, . . . ,13^} be disjoint subsets of {1, . . . and let 

(21) Vi=/(5i)+I(^2), V2 = I(fi2)+/(^3), V3 = /(fi4), 

where I{Bk) as before represents the indicator function for subset k (k = 
1, . . . , 4). The population covariance matrix is then given by S = C + cr^/p, 
where the noiseless matrix has the general form 
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with diagonal blocks Cn = crflp^xpi, C22 = (erf + cr|)lp2xp2) C'ss = f^ilpsxps' 
C44 = (T3IP4XP4, and off-diagonal blocks C12 = o-j^pixp2 and C23 = <72lp2xp3- 
Consider a numerical example with n = 100 observations, p = 500 vari- 
ables, and noise level a = 0.5. We choose the same form for the compo- 
nents ni,U2,M3 as in [Bair et al. (2006)], but associate the first two com- 
ponents with overlapping loading vectors vi and V2. Specifically, the com- 
ponents are given by ui = ±0.5 with equal probability, U2 = I{U2 < 0.4), 
and U3 = I{U^ < 0.3), where I{x) is the indicator of x, and Uj are all inde- 
pendent uniform random variables in [0,1]. The corresponding variances are 
af = 0.25, = 0.24, and 1T3 = 0.21. As for the blocks B^, we consider Bi = 
{1, . . . , 10}, ^2 = {11, • • • , 501,^3 = {51, . . . , 100}, and B4 = {201, . . . ,400}. 

Inference in this case is challenging for several different reasons. The sam- 
ple size n <p, the loading vectors vi and V2 are overlapping in the region 
;S2 = {11, . . . , 50}, and the signal-to-noise ratio is low with the variance o"^ of 
the noise essentially being of the same size as the variances <t| of Uj. Further- 
more, the condition in equation (13) is not satisfied even for the population 
covariance matrix. Despite these difficulties, the treelet algorithm is remark- 
ably stable, returning results that by and large correctly identify the internal 
structures of the data. The details are summarized below. 

Figure 3 (top center) shows the energy score of the best ii'-basis at differ- 
ent levels of the tree. We used 5-fold cross-validation; that is, we generated a 
single data set of n = 100 observations, but in each of the 5 computations the 
treelets were constructed on a subset of 80 observations, with 20 observations 
left out for the energy score computation. The five curves as well as their 
average clearly indicate a "knee" at the level L = 300. This is consistent with 
our expectations that the treelet algorithm mainly merges noise variables at 
levels L> | lJ^Sfc|. For a tree with "optimum" height L = 300, as indicated 
by the CV results, we then constructed a treelet basis on the full data set. 
Figure 3 (top right) shows the energy of these treelets sorted according to 
descending energy score. The results indicate that we have two dominant 
treelets, while the remaining treelets have an energy that is either slightly 
higher or of the same order as the variance of the noise. In Figure 3 (bottom 
left) we plot the loadings of the four highest energy treelets. "Treelet 1" 
(red) is approximately constant on the set B4 (the support of V3), "Treelet 
2" (blue) is approximately piecewise constant on blocks Bi, B2, and Bs (the 
support of vi and V2), while the low-energy degenerate treelets 3 (green) 
and 4 (magenta) seem to take differences between variables in the sets Bi, 
B2 , and B3 . Finally, we computed 95% confidence bands of the treelets using 
1000 bootstrap samples and the method described in Section 3.1. Figure 3 
(bottom right) indicate, that the treelet results for the two maximum energy 
treelets are rather stable despite the small sample size and the low signal- 
to-noise ratio. Most of the time the first treelet selects variables from B4, 
and most of the time the second treelet selects variables from B2 and either 
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Bi or Bs or both sets. The low-energy treelets seem to pick up differences 
between blocks Bi, B2, and B3, but the exact order in which they select 
the variables vary from simulation to simulation. As described in the next 
section, for the purpose of regression, the main point is that the linear span 
of the first few highest energy treelets is a good approximation of the span 
of the unknown loading vectors, Spanjvi, . . . , v/^}. 

4.2. The treelet transform as a feature selection scheme prior to regres- 
sion. Knowing some of the basic properties of treelets, we now examine 
a typical regression or classification problem with data {xj,yj}"^]^ given by 
equations (15) and (16). As the data x are noisy, this is an error-in-variables 
type problem. Given a training set, the goal is to construct a linear function 
/ : R^' — > R to predict y = /(x) = r • x + 6 for a new observation x. 

Before considering the performance of treelets and other algorithms in this 
setting, we review some of the properties of the optimal mean-squared error 
(MSE) predictor. For simplicity, we consider the case y = ui in equation 
(16), and denote by Pi:R^^RP the projection operator onto the space 
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Fig. 3. Top left: The vectors vi (blue), V2 (green), V3 (red) in Example 3. Top center: 
The "score" or total energy of K = 3 maximum variance treelets computed at different 
levels of the tree with 5-fold cross-validation; dotted lines represent the five different sim- 
ulations and the solid line the average score. Top right: Energy distribution of the treelet 
basis for the full data set at an "optimal" height L = 300. Bottom left: The four treelets 
with highest energy. Bottom right: 95% confidence bands by bootstrap for the two dominant 
treelets. 
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spanned by the vectors {v2, . . . , v/^}. In this setting the unbiased MSE- 
optimal estimator has a regression vector r = Vy/||vy|p, where Vy = vi — 
PiVi. The vector is the part of the loading vector vi that is unique to 
the response variable y = ui, since the projection of vi onto the span of the 
loading vectors of the other components {u2, ■ ■ ■ ,uk) has been subtracted. 
For example, in the case of only two components, we have that 

(23) = vi - -||— p-V2. 

The vector Vy plays a central role in chemometrics, where it is known as the 
net analyte signal [Lorber, Faber and Kowalski (1997), Nadler and Coifman 
(2005a)]. Using this vector for regression yields a mean squared error of 
prediction 

(24) niy-yf} 



V,, 



12 ■ 



'y\ 

We remark that, similar to shrinkage in point estimation, there exist biased 
estimators with smaller MSE [Gruber (1998), Nadler and Coifman (2005b)], 
but for large signal to noise ratios (o"/||vy|| <C 1), such shrinkage is negligible. 

Many regression methods [including multivariate least squares, partial 
least squares (PLS), principal component regression (PGR), etc.] attempt 
to compute the optimal regression vector or net analyte signal (NAS). It can 
be shown that in the limit n ^ oo, both PLS and PGR are MSE-optimal. 
However, in some applications, the number of variables is much larger than 
the number of observations {p ^ n). The question at hand is then, what the 
effect of small sample size is on these methods, when combined with noisy 
high-dimensional data. Both PLS and PGR first perform a global dimen- 
sionality reduction from ptok variables, and then apply least squares linear 
regression on these k features. As described in Nadler and Goifman (2005b), 
their main limitation is that in the presence of noisy high dimensional data, 
the computed projections are noisy themselves. For fixed p and n, a Taylor 
expansion of the regression coefficient as a function of the noise level a shows 
that these methods have an averaged prediction error 



(25) niy-yf} 



Ci C2 p2 



In equation (25) the coefficients ci and C2 are both 0(1) constants, indepen- 
dent of a, p, and n. The quantity /i depends on the specific algorithm used, 
and is a measure of the variances and covariances of the different compo- 
nents Uj , and of the amount of interferences of their loading vectors Vj . The 
key point of this analysis is that when p^n, the last term in (25) can dom- 
inate and lead to large prediction errors. This emphasizes the limitations of 
global dimensionality reduction methods, and the need for robust feature 
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selection and dimensionality reduction of the data prior to application of 
learning algorithms such as PGR and PLS. 

Other common approaches to dimensionality reduction in this setting are 
variable selection schemes, specifically those that choose a small subset of 
variables based on their individual correlation with the response y. To ana- 
lyze their performance, we consider a more general dimensionality reduction 
transformation r:M'P ^ M*^ defined by k orthonormal projections Wj G M^, 



(26) 



(X- Wi,X- Vir2,...,X- VlTfcj 



This family of transformations includes variable subset selection methods, 
where each projection Wj selects one of the original variables. It also includes 
wavelet methods and our proposed treelet transform. Since an orthonormal 
projection of a Gaussian noise vector in MP is a Gaussian vector in R^, and 
a relation similar to equation (15) holds between Tx and y, formula (25) 
still holds, but with the original dimension p replaced by k, and with Vy 
replaced by its projection Tvy, 



(27) my-yf} 



1 + ^ + 

n 



C2 cr" 



k^ 



^i||TV,y 



Equation (27) indicates that a dimensionality reduction scheme should ide- 
ally preserve the net analyte signal of y (||Tvj^|| ~ ||vj^||), while at the same 
time represent the data by as few features as possible {k <^p). 

The main problem of PGA is that it optimally fits the noisy data, yielding 
for the noise- free response llT^VyH/Uvj^H ~ (1 — ca'^p^ /v?). The main limita- 
tion of variable subset selection schemes is that in complex settings with 
overlapping vectors Vj, such schemes may at best yield ll^Vy ||/||vy || < 1. Due 
to high dimensionality, the latter methods may still achieve better prediction 
errors than methods that use all the original variables. However, with a more 
general variable transformation/compression method, one could potentially 
better capture the NAS. If the data x are a priori known to be smooth con- 
tinuous signals, a reasonable choice is wavelet compression, which is known 
to be asymptotically optimal. In the case of unstructured data, we propose 
to use treelets. 

To illustrate these points, we revisit Example 3 in Section 4.1, and com- 
pare treelets to the variable subset selection scheme of Bair et al. (2006) for 
PLS, as well as global PLS on all variables. As before, we consider a rela- 
tively small training set of size n = 100 but here we include 1500 additional 
noise variables, so that p = 2000 ^ n. We furthermore assume that the re- 
sponse is given hy y = 2ui. The vectors Vj are shown in Figure 3 (top left). 
The two vectors vi and V2 overlap, but vi (associated with the response) 
and V3 are orthogonal. Therefore, the response vector unique to y (the net 
analyte signal) is given by equation (23); see Figure 4 (left). 
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Fig. 4. Left: The vector Vy (only the first 150 coordinates are shown as the rest are 
zero). Right: Averaged prediction errors of 20 simulation results for the methods, from top 
to bottom: PLS on all variables (blue), supervised PLS with variable selection (purple), 
PLS on treelet features (green), and PLS on projections onto the true vectors v; (red). 



To compute Vy, all the 100 first coordinates (the set Bi U 82 U B3) are 
needed. However, a feature selection scheme that chooses variables based on 
their correlation to the response will pick the first 10 coordinates and then 
the next 40, that is, only variables in the set Bi U B2 (the support of the 
loading vector vi). Variables numbered 51 to 100 (set ^3), although criti- 
cal for prediction of the response y = 2ui, are uncorrelated with it (as ui 
and U2 are uncorrelated) and are thus not chosen, even in the limit 00. 
In contrast, even in the presence of moderate noise and a relatively small 
sample size of n = 100, the treelet algorithm correctly joins together the 
subsets of variables 1-10, 11-50, 51-100 and 201-400 (i.e., variables in the 
sets Bi, 82,83,84). The rest of the variables, which contain only noise, are 
combined only at much higher levels in the treelet algorithm, as they are 
asymptotically uncorrelated. Because of this, using only coarse-grained sum 
variables in the treelet transform yields near optimal prediction errors. In 
Figure 4 (right) we plot the mean squared error of prediction (MSEP) for 
20 different simulations with prediction error computed on an independent 
test set of 500 observations. The different methods are PLS on all vari- 
ables (MSEP = 0.17), supervised PLS with variable selection as in Bair et al. 
(2006) (MSEP = 0.09), PLS on the 50 treelet features with highest variance, 
with the level of the treelet determined by leave-one-out cross validation 
(MSEP = 0.035), and finally PLS on the projection of the noisy data onto 
the true vectors Vj, assuming they were known (MSEP = 0.030). In all cases, 
the optimal number of PLS projections (latent variables) is also determined 
by leave-one-out cross validation. Due to the high dimensionality of the data, 
choosing a subset of the original variables performs better than full-variable 
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methods. However, choosing a subset of treelet features performs even bet- 
ter, yielding an almost optimal prediction error (c^/Hvyp ~ 0.03); compare 
the green and red curves in the figure. 

5. Examples. 

5.1. Hyperspectral analysis and classification of biomedical tissue. To il- 
lustrate how our method works for data with highly complex dependencies 
between variables, we use an example from hyperspectral imaging of biomed- 
ical tissue. Here we analyze a hyperspectral image of an H&E stained mi- 
croarray section of normal human colon tissue [see Angeletti et al. (2005) 
for details on the data collection method]. This is an ordered data set of 
moderate to high dimension. One scan of the tissue specimen returns a 
1024 X 1280 data cube or "hyperspectral image," where each pixel location 
contains spectral measurements at 28 known frequencies between 420 nm 
and 690 nm. These spectra give information about the chemical structure 
of the tissue. There is, however, redundancy as well as noise in the spectra. 
The challenge is to find the right coordinate system for this relatively high- 
dimensional space, and extract coordinates (features) that contain the most 
useful information about the chemicals and substances of interest. 

We consider the problem of tissue discrimination using only spectral in- 
formation. With the help of a pathologist, we manually label about 60000 
pixels of the image as belonging to three different tissue types (colon cell nu- 
clei, cytoplasm of colon cells, cytoplasm of goblet cells). Figure 5 shows the 
locations of the labeled pixels, and their tissue-specific transmission spectra. 
Figure 6 shows an example of how treelets can learn the covariance structure 
for colon cell nuclei (Tissue type 3). The method learns both the tree struc- 
ture and a basis through a series of Jacobi rotations (see top right panel) . By 
construction, the basis vectors are localized and supported on nested clusters 
in the tree (see the bottom left and top left panels). As a comparison, we 
have also computed the PCA eigenvectors. The latter vectors are global and 
involve all the original variables (see bottom right panel). 

In a similar way, we apply the treelet transform to the training data 
in a 5-fold cross-validation test on the full data set with labeled spectra: 
Using a (maximum height) treelet decomposition, we construct a basis for 
the training set in each fold. To each basis vector, we assign a discriminant 
score that quantifies how well it distinguishes spectra from two different 
tissue types. The total score for vector Wj is defined as 

(28) £M=f: E ^(#ii#^), 

j=lk=l;k^j 

where K = 3 is the number of classes, and H {p\-'^ Wpf'^ ) is the Kullback- 
Leibler distance between the estimated marginal density functions p\ and 
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PI of class-j and class-k signals, respectively, in the direction of Wi. We 
project our training data onto the K (< 28) most discriminant directions, 
and build a Gaussian classifier in this reduced feature space. This classifier 
is finally used to label the test data and to estimate the misclassification 
error rate. The left panel in Figure 7 shows the average CV error rate as 
a function of the number of local discriminant features. (As a comparison, 
we show similar results for Haar- Walsh wavelet packets and a local discrim- 
inant basis [Saito, Coifman, Geshwind and Warner (2002)] which use the 
same discriminant score to search through a library of orthonormal wavelet 
bases.) The straight line represents the error rate if we apply a Gaussian 
classifier directly to the 28 components in the original coordinate system. 
The key point is that, with 3 treelet features, we get the same performance 
as if we used all the original data. Using more treelet features yields an even 
lower misclassification rate. (Because of the large sample size, the curse of 
dimensionality is not noticeable for < 15 features.) These results indicate 
that a treelet representation has advantages beyond the obvious benefits 
of a dimensionality reduction. We are effectively "denoising" the data by 
changing our coordinate system and discarding irrelevant coordinates. The 
right panel in Figure 7 shows the three most discriminant treelet vectors for 
the full data set. These vectors resemble continuous- valued versions of the 
indicator functions in Section 3.2. Projecting onto one of these vectors has 
the effect of first taking a weighted average of adjacent spectral bands, and 
then computing a difference between averages of bands in different regions 




Fig. 5. Left: Microscopic image of a cross-section of colon tissue. At each pixel po- 
sition, the spectral characteristics of the tissue is measured at 28 different wavelengths 
('A = 420, 430, . . . , 690 nm). For our analysis, we manually label about 60000 individual 
spectra: Red marks the locations of spectra of "Tissue type 1" (nuclei), green "Tissue type 
2" (cytoplasm of colon cells), and blue corresponds to samples of "Tissue type 3" (cyto- 
plasm of goblet cells). Right: Spectral signatures of the 3 different tissue types. Each plot 
shows the sample mean and standard deviation of the log-transmission spectra. 
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Fig. 6. Top left: Learned tree structure for nuclei (Tissue Type 1). In the dendrogram 
the height of each U-shaped line represents the distance dij — (1 — pij)/2, where pij is the 
correlation coefficient for the two variables combined. The leaf nodes represent the p = 28 
original spectral bands. Top right: 2D scatter plots of the data at levels £ = 1, . . . ,p — 1. 
Each plot shows 500 randomly chosen data points; the lines indicate the first principal 
directions and rotations relative to the variables that are combined. (Note that a Haar 
wavelet corresponds to a fixed 7r/4 rotation.) Bottom left: Learned orthonormal basis. Each 
row represents a localized vector, supported on a cluster in the hierarchical tree. Bottom 
right: Basis computed by a global eigenvector analysis (PCA). 



of the spectrum. (In Section 5.3, Figure 10, we will see another example that 
the loadings themselves contain information about structure in data.) 

5.2. A classification example with an internet advertisement data set. 
Here we study an internet advertisement data set from the UCI ML repos- 
itory [Kushmerick (1999)]. This is an example of an unordered data set of 
high dimension where many variables are collinear. After removal of the first 
three continuous variables, this set contains 1555 binary variables and 3279 
observations, labeled as belonging to one of two classes. The goal is to pre- 
dict whether a new observation (an image in an internet page) is an internet 
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Fig. 7. Left: Average misclassification rate (in a 5-fold cross-validation test) as a func- 
tion of the number of top discriminant features retained, for a treelet decomposition (rings), 
and for Haar-Walsh wavelet packets (crosses). The constant level around 2.5% indicates 
the performance of a classifier directly applied to the 28 components in the original co- 
ordinate system. Right: The top 3 local discriminant basis (LDB) vectors in a treelet 
decomposition of the full data set. 



Table 1 

Classification test errors for an internet advertisement data set 



Classifier 


Full data set 


Reduced data set 


Final representation with 




(1555 variables) 


(760 variables) 


coarse-grained treelet features 


LDA 


5.5% 


5.1% 


4.5% 


1-NN 


4.0% 


4.0% 


3.7% 



advertisement or not, given values of its 1555 variables (various features of 
the image). 

With standard classification algorithms, one can easily obtain a general- 
ization error of about 5%. The first column in Table 1, labeled "full data 
set," shows the misclassification rate for linear discriminant analysis (LDA) 
(with the additional assumption of a diagonal covariance matrix), and for 
1-nearest neighbor (1-NN) classification. The average is taken over 25 ran- 
domly selected training and test sets, with 3100 and 179 observations each. 

The internet-ad data set has several distinctive properties that are clearly 
revealed by an analysis with treelets: First of all, several of the original vari- 
ables are exactly linearly related. As the data are binary (—1 or 1), these 
variables are either identical or of opposite values. In fact, one can reduce the 
dimensionality of the data from 1555 to 760 without loss of information. The 
second column in the table labeled "reduced data set" shows the decrease 
in error rate after a lossless compression where we have simply removed 
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Correlation Matrix. 200 Original Variables First 2O0 ordered variables 




50 100 150 200 50 100 150 200 



Fig. 8. Left: The correlation matrix of the first 200 out of 760 variables in the order 
they were originally given. Right: The corresponding matrix, after sorting all variables 
according to the order in which they are combined by the treelet algorithm. 



redundant variables. Furthermore, of these remaining 760 variables, many 
are highly related, with subsets of similar variables. The treelet algorithm 
automatically identifies these groups, as the algorithm reorders the variables 
during the basis computation, encoding the information in such a group with 
a coarse-grained sum variable and difference variables for the residuals. Fig- 
ure 8, left, shows the correlation matrix of the first 200 out of 760 variables 
in the order they are given. To the right, we see the corresponding matrix, 
after sorting all variables according to the order in which they are combined 
by the treelet algorithm. Note how the (previously hidden) block structures 
"pop out." 

A more detailed analysis of the reduced data set with 760 variables shows 
that there are more than 200 distinct pairs of variables with a correlation 
coefficient larger than 0.95. Not surprisingly, as shown in the right column 
of Table 1, treelets can further increase the predictive performance on this 
data set, yielding results competitive with other feature selection methods 
in the literature [Zhao and Liu (2007)]. All results in Table 1 are averaged 
over 25 different simulations. As in Section 4.2, the results are achieved at a 
level L < p — 1, by projecting the data onto the treelet scaling functions, that 
is, by only using coarse-grained sum variables. The height L of the tree is 
found by 10-fold cross-validation and a minimum prediction error criterion. 

5.3. Classification and analysis of DNA microarray data. We conclude 
with an application to DNA microarray data. In the analysis of gene ex- 
pression, many methods first identify groups of highly correlated variables 
and then choose a few representative genes for each group (a so-called gene 
signature). The treelet method also identifies subsets of genes that exhibit 
similar expression patterns, but in contrast, replaces each such localized 
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group by a linear combination that encodes the information from ah vari- 
ables in that group. As illustrated in previous examples in the paper, such 
a representation typically regularizes the data which improves the perfor- 
mance of regression and classification algorithms. 

Another advantage is that the treelet method yields a multi-scale data 
representation well-suited for the application. The benefits of hierarchical 
clustering in exploring and visualizing microarray data are well recognized 
in the field [Eisen et al. (1998), Tibshirani et al. (1999)]. It is, for example, 
known that a hierarchical clustering (or dendrogram) of genes can sometimes 
reveal interesting clusters of genes worth further investigation. Similarly, a 
dendrogram of samples may identify cases with similar medical conditions. 
The treelet algorithm automatically yields such a re-arrangement and inter- 
pretation of the data. It also provides an orthogonal basis for data represen- 
tation and compression. 

We illustrate our method on the leukemia data set of Golub et al. (1999). 
This data monitor expression levels for 7129 genes and 72 patients, suf- 
fering from acute lymphoblastic leukemia (ALL, 47 cases) or acute myeloid 
leukemia (AML, 25 cases). The data are known to have a low intrinsic dimen- 
sionality, with groups of genes having similar expression patterns 
across samples (cell lines). The full data set is available at 
http://www.genome.wi.mit.edu/MPR, and includes a training set of 38 
samples and a test set of 34 samples. 

Prior to analysis, we use a standard two-sample t-test to select genes that 
are differentially expressed in the two leukemia types. Using the training 
data, we perform a full (i.e., maximum height) treelet decomposition of the 
p = 1000 most "significant" genes. We sort the treelets according to their 
energy content [equation (5)] on the training samples, and project the test 
data onto the K treelets with the highest energy score. The reduced data 
representation of each sample (from p genes to K features) is finally used to 
classify the samples into the two leukemia types, ALL or AML. We examine 
two different classification schemes: 

In the first case, we apply a linear Gaussian classifier (LDA). As in Sec- 
tion 5.2, the treelet transform serves as a feature extraction and dimen- 
sionality reduction tool prior to classification. The appropriate value of the 
dimension K is chosen by 10-fold cross-validation (CV). We divide the train- 
ing set at random into 10 approximately equal-size parts, perform a separate 
t-test in each fold, and choose the K-value that leads to the smallest CV 
classification error (Figure 9, left). 

In the second case, we classify the data using a novel two-way treelet 
decomposition scheme: we first compute treelets on the genes, then we com- 
pute treelets on the samples. As before, each sample (patient) is represented 
by K treelet features instead of the p original genes. The dimension K is 
chosen by cross-validation on the training set. However, instead of applying 
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a standard classifier, we construct treelets on the samples using the new 
patient profiles. The two main branches of the associated dendrogram di- 
vide the samples into two classes, which are labeled using the training data 
and a majority vote. Such a two-way decomposition — of both genes and 
samples — leads to classification results competitive with other algorithms; 
see Figure 9, right, and Table 2 for a comparison with benchmark results in 




Fig. 9. Number of misclassified cases as a function of the number of treelet features. Left: 
LDA on treelet features; ten-fold cross-validation gives the lowest misclassification rate 
(2/38) for K = 3 treelets; the test error rate is then 3/34. Right: Two-way decomposition 
of both genes and samples; the lowest CV misclassification rate (Q/3S) is for K = 4; the 
test error rate is then 1/34. 
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Fig. 10. Left, the gene expression data with rows (genes) and columns (samples) ordered 
according to a hierarchical two-way clustering with treelets. (For display purposes, the 
expression levels for each gene are here normalized across the samples to zero mean and 
unit standard deviation.) Right, the three maximum energy treelets on ordered samples. 
The loadings of the highest- energy treelet (red) is a good predictor of the true labels (blue 
circles ) . 
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Table 2 

Leukemia misclassification rates; courtesy of Zou and Hastie (2005) 



Method 


Ten-fold CV error 


Test error 


Golub et al. (1999) 


3/38 


4/34 


Support vector machines (Guyon et al. (2002)) 


2/38 


1/34 


Nearest shrunken centroids (Tibshirani et al. (2002)) 


2/38 


2/34 


Penalized logistic regression (Zhu and Hastie (2004)) 


2/38 


1/34 


Elastic nets (Zou and Hastie (2005)) 


3/38 


0/34 


LDA on treelet features 


2/38 


3/34 


Two-way treelet decomposition 


0/38 


1/34 



Zou and Hastie (2005). Moreover, the proposed method returns orthogonal 
functions with continuous-valued information on hierarchical groupings of 
genes or samples. 

Figure 10 (left) displays the original microarray data, with rows (genes) 
and columns (samples) ordered according to a hierarchical two-way cluster- 
ing with treelets. The graph to the right shows the three maximum energy 
treelets on ordered samples. Note that the loadings are small for the two 
cases that are misclassified. In particular, "Treelet 2" is a good "continuous- 
valued" indicator function of the true classes. The results for the treelets on 
genes are similar. The key point is that whenever there is a group of highly 
correlated variables (genes or samples), the algorithm tends to choose a 
coarse-grained variable for that whole group (see, e.g., "Treelet 3" in the 
figure). The weighting is adaptive, with loadings that reflect the complex 
internal data structure. 

6. Conclusions. In the paper we described a variety of situations where 
the treelet transform outperforms PCA and some common variable selec- 
tion methods. The method is especially useful as a feature extraction and 
regularization method in situations where variables are collinear and/or the 
data is noisy with the number of variables, p, far exceeding the number of 
observations, n. The algorithm is fully adaptive, and returns both a hierar- 
chical tree and loading functions that reflect the internal localized structure 
of the data. We showed that, for a covariance model with block structure, 
the maximum energy treelets converge to a solution where they are constant 
on each set of indistinguishable variables. Furthermore, the convergence rate 
of treelets is considerably faster than PCA, with the required sample size 
for consistency being n ^ Oilogp) instead of n ^ 0{p). Finally, we demon- 
strated the applicability of treelets on several real data sets with highly 
complex dependencies of variables. 



TREELETS 



33 



APPENDIX 

A.l. Proof of Theorem 1. Let x = (xi , . . . ,Xp)'^ he a random vector with 
distribution F and covariance matrix S = Sj? . Let pij denote the correlation 
between Xj and xj. Let x^, . . . , x" be a sample from F, and denote the sample 
covariance matrix and sample correlations by S and pij. Let Sp denote all 
p X p covariance matrices. Let 

Tn{b) = < F -.T^F is positive definite, min aj >b>. 

Any of the assumptions (Ala), (Alb), or (Ale) are sufficient to guarantee 
certain exponential inequalities. 



Lemma A.l. There exist positive constants ci,C2 such that, for every 
e>0, 

(29) F{\\tjk - Sjfclloo > e) < ciple-^'^'\ 

Hence, 



-'jk ~ ^jfclloo — Op 



'logn 



n 



Proof. Under (Al), (29) is an immediate consequence of standard ex- 
ponential inequalities and the union bound. The last statement follows by 
setting en = K^/logn/n for sufficiently large K and applying (A2). □ 

Lemma A. 2. Assume either that (i) x is multivariate normal or that 
(ii) maxi<j<p \xj\ < B for some finite B and miuj aj >b > 0. Then, there 
exist positive constants cs,Ci such that, for every e>0. 



(30) P(^max \pjk - Pjk\ > ej < csp^-'""'''. 

Proof. Under normality, this follows from Kalisch and Biihlmann (2007). 
Under (ii) note that /i(o-i, (T2, 0-12) = ci2/(c"ic"2) satisfies 

^ ' ' ' 3max{|o-i 1^2- tT2l>i2- 0-12!} 
|/i((Ti, 0-2, 0-12) - /i(cri, 0-2, 0-12)1 < ^2 • 

The result then follows from the previous lemma. □ 



Let Jq denote the 2x2 rotation matrix of angle 6. Let 

_/cos(0(S)) -sin(0(S))\ 
^■^^ '^^ ~ {sm{9{J:)) cos(0(S)) J 
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denote the Jacobi rotation where 



(32) 6'(S) = -tan 



1 2Si2 



^11 — ^22 



Lemma A. 3. Let F be a bivariate distribution with 2x2 covariance 
matrix S. Let J = Js and J = J^. Then, 

(33) P(|| J^SJ - J^SJlloo > e) < C5p2e-"^6^'. 

Proof. Note that ^(S) a bounded, uniformly continuous function of S. 
Similarly, the entries of Jg are also bounded, uniformly continuous functions 
of Ti. The result then follows from (29). □ 

For any pair (a, /3), let 6'(a, /3) denote the angle of the principal component 
rotation and let J {a, (3,6) denote the Jacobi rotation on {a, (3). Define the 
selection operator 

A:Sp^{{j,k):l<j <k<p} 

by A(S) = (a,/3) where Pa^p = argmaXj^pjj. In case of ties, define A(S) 
to be the set of pairs (a,/3) at which the maximum occurs. Hence, A is 
multivalued on a subset S* C Sp of measure 0. The one-step treelet operator 
T : Sp ^ Sp is defined by 

(34) r(S) = { J^S J : J = J{a, l3, e{a, /?)), (a, /3) G A(S)}. 
Formally, T is a multivalued map because of potential ties. 

Proof of Theorem 1. The proof is immediate from the lemmas. 
For the matrices Tin-, we have that \\T,n — 5]||oo < except on a set 
of probability tending to at rate 0{n~^^~'^'^^). Hence, on the set An = 
: ^ — Sniloo < ^n}, we have that T(S.„) G Tn{Y!,). The same holds at 
each step. □ 

A. 2. Proof of Lemma 1. Consider first the case where at each level in 
the tree the treelet operator combines a coarse-grained variable with a sin- 
gleton according to {{xi,X2},X3}, — Let sq = xi. For £=l, the 2x2 co- 
variance submatrix S^'^^ =V{(so,X2)} =o'i(i i)- A principal component 
analysis of S*-^-* gives 9i = ir/A and si = '^{xi + X2)- By induction, for 
l<^<p-l, S(^-i) =V{(s£_i,x^+i)} = af(^ PCAonS(^-i) gives 

the (unconstrained) rotation angle 9£ = arctan ^/I, and the new sum variable 
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More generally, at level I of the tree, the treelet operator combines two 
sum variables u=^ T,ieAu ^» ^^^^ v = ^ EjeA. ^i' where X, A ^ {1, . . . ,p} 
denote two disjoint index subsets with m = \Au\ and n = \Av\ number of 
terms, respectively. The 2x2 covariance submatrix 

(35) J:^'-'^^Y{{u,v)} = al( 

The correlation coefficient puv = 1 for any pair {u,v); thus, the treelet op- 
erator T£ is a multivariate function of S. A principal component analy- 
sis of gives the eigenvalues Ai = m + n, A2 = 0, and eigenvectors 
^1 = ^2 = 7=^(-V^' yMV- The rotation angle 

ITl 

(36) Oi = arctan 



The new sum and difference variables at level i are given by 

S£ = ^ {+\/mu + ^/nv) 
\Jm + n 



(37) 



de = {—y/nu + \fmv) 

+ n 



y/m + n 

The results of the lemma follow. 

A. 3. Proof of Theorem 2. Assume that variables from different blocks 
have not been merged for levels i' < i, where 1 < I <p. Prom Lemma 1, 
we then know that any two sum variables at the preceding level 1—1 have 
the general form u = J2ieAu ^^^'^ ^ ~ ^ieX ^i' where Au and Av 
are two disjoint index subsets with m = \Au\ and n = \Av\ number of terms, 
respectively. Let 5k = (r/ak- 

If C Bi and Av C Bj, where i / j, that is, the subsets belong to different 
blocks, then 

(3S) v((„,„))^(;^^._^ ^•')^^'- 

The corresponding "between-block" correlation coefficient 



(^-1) o-,:,- vmn . an 



(39) pI-'>= ^ ^ < 
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with equality ("worst-case scenario") if and only if a = 0. 

If Au,Av C Bk, that is, the subsets belong to the same block, then 

(40) s(^^^)=V{(n,f)} = 

The corresponding "within-block" correlation coefficient 



(^-1) 
Pw 



1 



(41) 



> 



y^l + (m + n)/{'mn)5l + (l/(mn))(5^ 
1 



l + 3max(52,54) 



with the "worst-case scenario" occurring when m = n = 1, that is, when 
singletons are combined. Finally, the main result of the theorem follows 
from the bounds in Equations (39) and (41), and the fact that 

(42) max/3^ <mmp^y 

for (. = 1,2, . . . ,p — K is a sufficient condition for not combining variables 
from different blocks. If the inequality equation (13) is satisfied, then the 
coefficients in the treelet expansion have the general form in equation (37) 
at any level ^ of the tree. With white noise added, the expansion coefficients 
have variances N{si} = {m + n)al + a"^ and ^{de} = a"^ ■ Further- 

more, E{s£} = E{(i£} = 0. 
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